4. Main: Exploratory Data Analysis
4.1 Study 1: Time and Space study
Hypothesis 1: Intuitively, number of accidents vary largely with time of the day(Yes, even for the city that never sleeps)
- Total amount of car accidents in NYC by weekdays and weekends
Let’s look at the amount of car accidents in the city and see if there’s any difference between the collisions occurred on weekdays and weekends.
Before analyzing the data, my initial assumption was that there would be more accidents happening during the rush hour on weekdays, which was roughly correct. On weekdays, highest peaks occurred approximately around 7am to 9am and 4pm to 7pm. But for weekends, accident rates were high in the afternoon around 1pm to 5pm. Since there is no unanimous weekend rush hour in NYC, it was hard to pinpoint that these time ranges were part of the weekend rush hour. However, I’d like to note that the weekend rush hour in the afternoon followed a similar trend as the weekday rush hour, especially the peaks between 1pm and 3pm, 3pm and 5pm.
Hypothesis 2: Trend of total collision
By looking at each borough, we were able to tell that there were more accidents occurred in Brooklyn, Manhattan, and Queens than the Bronx and Staten Island. Note that NA refers to the data with missing boroughs.
From here, I decided to look more into the data by geo-location during the rush hour, which indicate where the accidents happened during the rush hour.
One thing to note is that since there were many missing latitude and longitude values, not all accidents were recorded in the dataset. Further analysis on missing values can be found at the analysis for data quality part.f
To go back to the analysis, first, I separated the data by the number of persons killed and injured and located them in the map.
For weekdays, I showed the locations of persons injured and killed during 7am ~ 9am and 4pm ~ 7pm.
It is clear that there were higher rates of injuries and deaths associated with the accidents in the evening than those in the morning.
For weekends, I showed the locations of persons injured and killed during 1pm ~ 5pm.
Without much surprise, both the number of injured and killed people showed fewer collision rates on weekends than those on weekdays. From here, I wanted to explore the data if these locations matched to any congestion areas in NYC.
Hypothesis 3: Congestion areas and fatal collisions.
- Trend of fatalities during rush hour
Since there were more death counts in the evenings based on above graphs, I focused on them to test if there was any locations matched to the congestion areas during 4pm ~ 7pm.
Again, I excluded the weekends accidents and weekdays morning rush hour because a) the purpose of this analysis was identifying whether congestion areas had any influence on fatal collisions and b) their numbers were not sufficient to test against my hypothesis.
- For weekdays, I showed the locations of persons killed during 4pm ~ 7pm
- A map of each borough
I zoomed in closer to see boroughs one by one, which would help me identify the congestion areas.
- Manhattan
Considering only on weekday evenings rush hour, I could not draw a conclusion on the relationship between congestion areas and death counts with the information I had.
- Queens
Near I-495, Horace Harding Expressway and Long Island Expressway had high death counts because these areas are one of the worst traffic corridors.
- Brooklyn
Though there was a widespread tendency for death collisions, I was able to identify death counts in some areas such as Bedford Stuyvesant, Flatbush Avenue near Ditmars Park, and Sheepsheade Bay Road where were known for issues with the traffic congestion problems.
- The Bronx
It was no surprise to see that near East 161st street by Yankee Stadium and Bronx courthouses had double death counts since they were infamously known for traffic problems including congestion and double parking. There have been many plans rolled out to help improve conditions on the busy road such as launching the Bx6 Select Bus Service.
- Staten Island
Hylan Boulevard, Staten Island’s longest commercial roadway, serves as one of borough’s primary roadways. Due to the nature and function of this corridor, Hylan Boulevard is frequently congested on weekdays, which marked the highest Data counts in the map.
4.2 Study 2: Impact of National Holidays on Traffic & Accidents
Summary: Holidays, yay! There is one more reason to cheer for them! Lowest accidents dates come as the holidays with closing shops and other business. These days are New Years Day, National Day, and Christmas. One extreme outlier that appears in the plot is on Feb 29, which is only collected in lunar year (2016).
ggplot(data = date_of_year, aes(x = doy, y = n, group = 1)) +
geom_line() +
scale_x_discrete(breaks = 10) +
labs(title = "Accidents Year Round Distribution", x = "Date", y = "Count of Accidents") +
geom_text(aes(label=ifelse(n<2100,doy,'')),nudge_x = -5, nudge_y = 0)
4.3 Study 3: Study of Alcohol Involvement
Hypothesis 1: Accidents involving alcohol are likely to occur on weekends
Finding: When we restricted the accident type into Alcohol involvement, the difference between weekdays and weekends are pretty clear.
Our hypothesis was that there was a higher amount of accidents because of alcohol consumption during the weekends than the weekdays. And the following plot proved our findings. Besides, we can see the trend of accidents increasing as the weekend is approaching. This result showed a relatively high number of accidents on Friday, which is the end of weekdays and the beginning of weekend as well.
ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ],
aes(Day.of.week, group = 1)) +
geom_histogram(stat = "count") +
ggtitle("Alcohol Involvement Accident Weekly Pattern") +
scale_x_discrete(limits = seq(0,6),labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
Hypothesis 2: Accidents involving alcohol are likely to occur at late night
Finding: Accidents increased since evening and reached the peak during the late night.
Without any filters on the contributing factor on accidents, the distribution of the accidents showed a skewed bell curve with rush hour, as is shown in the previous study. Once we narrowed down to alcohol related accidents, it showed an entirely opposite trend. As we can see from the following plot, alcohol involved cases are at the bottom during the day and gradually climbed up since 3pm to midnight. Late night (from 12am to 5am) is the peak of accidents with alcohol involvement, which implies the alcohol activities and its causation to accidents.
ggplot(data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ], aes(hour)) +
geom_histogram(stat = "count") +
ggtitle("Alcohol Involvement Accident hourly Pattern")
Hypothesis 3: Accidents involving alcohol will be clustered in Midtown-Downtown Manhattan
Finding: From the previous hypotheses, we sketched the features of alcohol involvement accidents are converged on weekends and during midnight. Now we will focus on where these accidents are located.
ManhattanMap <- qmap(location = "Manhattan", zoom = 11, color = "bw")
ManhattanMap +
geom_point(aes(x = LONGITUDE, y = LATITUDE), color = "gold", alpha = 0.1, data = df[df$CONTRIBUTING.FACTOR.VEHICLE.1 == "Alcohol Involvement"| df$CONTRIBUTING.FACTOR.VEHICLE.2 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.3 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.4 == 'Alcohol Involvement'| df$CONTRIBUTING.FACTOR.VEHICLE.5 == 'Alcohol Involvement', ])+
ggtitle("Alcohol Involvement on weekends and during midnight") +
labs(x="Longitude", y = "Latitude")
4.4 Study 4: Effect of Climate!
We as New Yorkians(This sounds way cooler, right? Never really liked Yorkers) are so flustered with Weather. We fall in love with it, go outside, and start hating it the next moment! Weather is such an integral part of human society that it is the topmost factor to consider, always. So, lets see how weather has an effect, correlation ofcourse(do I daresay causation?), on accidents! (effect of weather on pedestrians later). Don’t worry, you won’t be bored. I will keep you interested with my humour, or lack or it thereof.
Hypothesis 1: Temperature
Finding: Looks like our dear temperature has not much correlation with Accidents. Read on to find out more …
There does not seem to be much effect of Temparature on Accidents, Injuries or Deaths. Just to be sure, let’s look at the density plots
Hypothesis 2: Snowfall
Finding: Snowfall has a definite correlation with number of accidents and number of injuries!
Ah dear Lord Snow, you know nothing and are always butting in things!(Get it? No??? Watch Game of Thrones now!). Looks like higher the snowfall, higher the number of accidents and number of injuries! Lets confirm this from comparing density plots. Lord Snow is compassionate enough to come few times a year and hence not many deaths on the plot(data insufficient!)
(Bored already? Some humour: I call the above plot “Manhattan Graph”. Quick, spot the World Trade Center!!)
Hypothesis 3: Rain! Rain! Rain!
Finding: Like his sister Snow, Rain has a correltion with Accidents, Injuries and even Deaths. Read on …
Can’t make out anything from this! But, I was so sure that Rain makes driving and walking difficult and hence was hoping would result in more accidents!(Beware: Sadistic Tendencies :/ ). Now, let’s try and see my favorite density plots will unveil some abracadabra!
Hypothesis 4: Wind and Fog
Finding: Contrary to my belief, Wind and Fog don’t affect Accidents much. Now, beleive me when I say we have enough data! “Lets speed on to Density plots, I say”
4.5 Study 5: What impacts accidents involving pedestrians?
Now, we come to, in my opinion, the most impactful part of this analysis. In this world, we have to strive to make Lives easier to people, and a main part of that is that no one should be physically hurt at the least! Thanks to the technological advancement in safety features of Cars, Accident related injuries to people has drastically reduced since the early 90’s. However, we homo sapiens are awaiting an upgrade in body structure(BS, I tell you) from our beloved Lord Thor for >5000 years now. So, if we want to become more green, start walking (or taking public transport), we have to start wearing helmets while walking(Haha), or carefully follow whatever I say below (I mostly say, “Stay at home, grab a cup of coffee and a Novel, the surest way of being peaceful in life”!
Hypothesis 1: Pedestrians are more likely to be hit in the night, when it is difficult to spot people.
Finding: It is indeed! Special concern to the Pedestrian Deaths which seem to increase disproportionately at night.
You see the shifting peaks? Highly likely that at night, it is very difficult to spot people on the road, and by the time you spot someone, it is already too late! So many deaths at night! Be careful people, especially when the Sun goes down. Follow the traffic rules and be safe!
Hypothesis 2: A side study, inspired from above. Injuries and Accidents should be more rampant at night too!
Finding: Oh surely! Look at the disproportionate deats late in the night! More likely explanaiton is that people cannot concentrate from 12AM-4AM(sleep cycle!) and reaction times reduce dramatically then. This might lead to more “Severe” accidents.
Finding: Turns out low temperatures and high snow coorrelate highly with Pedestrian Accidents, Injuries and Deaths. Rain and Fog,not so much
1) Temprature and Pedestrians
2) Snow and Pedestrians
Snow and pedestrians, a better love story than Twilight :D
3) Rain and Pedestrians
#### Not much of a correlation